feat: Add session resilience and context budget management by mdear · Pull Request #9 · Intelligent-Internet/CommonGround

mdear · 2025-12-29T14:52:51Z

This commit introduces two major features: WebSocket session resilience for surviving temporary disconnects, and proactive context budget management to prevent context window overflow.

Session Resilience (Backend + Frontend)

JWT-based session security with HttpOnly fingerprint cookie binding (RFC 8725)
Dual heartbeat mechanism (server 30s ping + client 20s heartbeat)
120-second reconnection grace period with event buffering
Automatic token refresh at 90% of JWT lifespan
Run state preservation during disconnects

New files:

core/api/connection_manager.py - Connection state and event buffering
core/api/session_security.py - JWT lifecycle and fingerprint binding
frontend/lib/sessionManager.ts - Client-side session management

Context Budget Management

Provider-aware token counting (Anthropic, OpenAI, Google APIs)
Circuit breaker pattern: warning at 40%, force completion at 55%
Budget-aware content selection for work module inheritance
Per-worker budget allocation for parallel execution

New files:

core/agent_core/llm/token_counter.py - Accurate provider-specific counting
core/agent_core/framework/context_budget_guardian.py - Threshold monitoring
core/agent_core/utils/content_selection.py - Budget-aware inheritance

Test Coverage

878 backend unit tests (33 new test files)
25 frontend tests for session management
Coverage for all new modules

Documentation

docs/architecture/session-resilience.md - Full design specification
docs/architecture/context-budget-management.md - Budget system design
docs/guides/04-debugging.md - Added CLI tools documentation
scripts/analyze_session.py - Session analysis utility
scripts/commonground.sh - Service manager script

Other Changes

Graceful shutdown with connection cleanup
Updated pyproject.toml with uv export instructions
Anthropic-specific LLM configs for accurate token budgeting
Agent profile updates for budget-aware operation

This commit introduces two major features: WebSocket session resilience for surviving temporary disconnects, and proactive context budget management to prevent context window overflow. ## Session Resilience (Backend + Frontend) - JWT-based session security with HttpOnly fingerprint cookie binding (RFC 8725) - Dual heartbeat mechanism (server 30s ping + client 20s heartbeat) - 120-second reconnection grace period with event buffering - Automatic token refresh at 90% of JWT lifespan - Run state preservation during disconnects New files: - core/api/connection_manager.py - Connection state and event buffering - core/api/session_security.py - JWT lifecycle and fingerprint binding - frontend/lib/sessionManager.ts - Client-side session management ## Context Budget Management - Provider-aware token counting (Anthropic, OpenAI, Google APIs) - Circuit breaker pattern: warning at 40%, force completion at 55% - Budget-aware content selection for work module inheritance - Per-worker budget allocation for parallel execution New files: - core/agent_core/llm/token_counter.py - Accurate provider-specific counting - core/agent_core/framework/context_budget_guardian.py - Threshold monitoring - core/agent_core/utils/content_selection.py - Budget-aware inheritance ## Test Coverage - 878 backend unit tests (33 new test files) - 25 frontend tests for session management - Coverage for all new modules ## Documentation - docs/architecture/session-resilience.md - Full design specification - docs/architecture/context-budget-management.md - Budget system design - docs/guides/04-debugging.md - Added CLI tools documentation - scripts/analyze_session.py - Session analysis utility - scripts/commonground.sh - Service manager script ## Other Changes - Graceful shutdown with connection cleanup - Updated pyproject.toml with uv export instructions - Anthropic-specific LLM configs for accurate token budgeting - Agent profile updates for budget-aware operation

mdear · 2025-12-29T14:57:15Z

Hi, team, here are some stability and resilience fixes that I did to support integration of my own MCP server (proprietary knowledge base for wheelchair seating/mobility, which is capable of quickly overwhelming a model's context without proper controls).

My strengths lie mostly in backend infrastructure, so I kept my frontend changes light, really only enough so I could have enough stability to be able to properly evaluate this solution.

I introduced unit test infras, capturing all backend current behavior.
I only did light unit testing on the frontend, would appreciate any review from those with more expertise than I.

Respect! This is my way of showing in a (hopefully) useful way that I support you and what you are trying to do.

Any and all constructive criticism/review/suggestions are welcome.

…wareness Context Budget System: - Add context_admission_controller for pre-admission budget enforcement - Add context_budget_handback for Principal-delegated summarization - Update thresholds: WARNING 60%, CRITICAL 75%, EXCEEDED 85% - Implement agent-type-aware forcing (Principal/Associate only) - Partner agents receive guidance only (no flow-ending tools) Orphan Detection: - Add detect_orphaned_tool_interactions() to turn_manager - Add finalize_orphaned_tool_interactions() for recovery - Add detect_dispatch_anomalies() to dispatcher_node Session Analysis (analyze_session.py): - Add --mode handoff/thrashing/errors analysis modes - Fix analyze_work_modules() to aggregate ALL context_archive entries - Add dispatch_count tracking for thrashing detection - Improve error detection to avoid false positives Bug Fixes: - Fix DuckDBRAGStore unawaited coroutine warning (lazy init) - Rename test_jina_* to check_jina_* to avoid pytest auto-discovery - Remove unused pythonjsonlogger import (deprecation warning) - Fix sessionManager to always create fresh session_id for WS Frontend: - Increase node fallback dimensions for better visual fit - Fix sessionManager reconnection flow Docs: - Update context-budget-management.md with implementation status Tests: 934 passed, 1 skipped

Flow visualization improvements: - Add dynamic minZoom that adapts to card count (see all cards at min zoom) - Fix maxZoom at 1.5x for readable card text regardless of card count - Align scroll wheel zoom speed between minimap and canvas (~9 clicks) - Add translateExtent to constrain panning within node bounds - Add status-based MiniMap colors (blue=running, green=success, red=error) Scroll and layout fixes: - Fix page-level scrolling by adding overflow:hidden to html/body/SidebarProvider - Fix auto-scroll on page load (scrollIntoView block:'nearest') - Add overscroll-contain to ChatHistory to prevent scroll chaining Swim lane layout (flow-utils.ts): - Rewrite layout algorithm for fixed-width swim lanes per agent - Increase node fallback dimensions for better readability - Add minimum dimension enforcement in getNodeSize() Files changed: - FlowView.tsx: zoom config, MiniMap styling, ReactFlowProvider wrapper - ChatLayout.tsx: overflow-hidden on panels - Workspace.tsx: overflow-hidden on container - ChatHistory.tsx: overscroll-contain - flow-utils.ts: swim lane algorithm - globals.css: html/body overflow hidden - layout.tsx: SidebarProvider height constraints - r/page.tsx: scrollIntoView fix

Associates that output JSON deliverables without calling `finish_flow` would have their work lost, as the system only triggers deliverable extraction when `finish_flow` is invoked. Live session analysis revealed this caused re-dispatching. The `generate_message_summary` instructional prompt told agents "DO NOT call any tools" after outputting JSON, but `finish_flow` IS required to trigger `_extract_deliverables_from_messages()` and capture the work. - Updated instructional prompt to explicitly describe the 3-response sequence: generate_message_summary → JSON output → finish_flow - Added critical warning about deliverable capture requirement - Added "Finish Protocol" section documenting the completion sequence - Updated self-reflection to detect JSON-without-finish_flow state - Fixed observation text to match actual trigger conditions - Added "CRITICAL CHECK" for JSON deliverable detection - Updated instructions to guide agents through finish protocol - Fixed incomplete sentence ("MUST synthesis" → proper guidance) - Updated Deliver step to mention `finish_flow` requirement - Analyzed production runs confirming the JSON → finish_flow sequence across all completed work modules - All unit tests pass - No regressions expected - changes are corrective/additive

Flow visualization now groups disconnected subgraphs into time-sorted epochs, ensuring timestamps always flow top-to-bottom (swimlane style). Changes: - Detect epochs via flood-fill of disconnected turn subgraphs - Sort epochs by earliest timestamp for chronological ordering - Add epoch separator nodes between epochs with proper labels - Add "Epoch 1" header when multiple epochs exist - Create edges connecting separators to adjacent epoch roots/leaves - Filter Partner and user_turn before epoch detection - Add epoch_separator nodeType to frontend FlowView component - Update FlowViewModel documentation in API reference - Add comprehensive unit tests for epoch detection logic Fixes issue where cards from re-dispatched work modules appeared out of chronological order in the flow visualization.

- Detect disconnected subgraphs as epochs, sort by timestamp - Add epoch separator nodes with edges to adjacent epochs - Show "Epoch N" headers only when multiple epochs exist - Filter Partner/user_turn before epoch detection - Update frontend to render epoch_separator nodeType - Update API docs for FlowViewModel epoch fields

Core Fixes: - Add tool conflict resolution to prioritize finish_flow when called with other tools (prevents silent data loss from juicy-winged-adder) - Add synchronous completion handler ensuring deliverables propagate to Partner inbox before session save - Add /api/reports/{project_id}/{filename} endpoint for report downloads Port Configuration: - Consolidate all port config to core/.env as single source of truth - Update commonground.sh to read BACKEND_PORT/FRONTEND_PORT from .env - Update run_server.py to use .env defaults for host/port - Update analysis scripts to read from .env instead of hardcoding Pagination Enhancements: - Add work module archive pagination to get_paginated_run_snapshot - Support work_module_id and archive_index params for deep inspection - Update message_handlers.py to pass new pagination params - Update live_session_query.py to reconstruct archives with pagination Frontend: - Fix heartbeat/reconnect message format to use data wrapper - Increase heartbeat tolerance (30s interval, 4 missed max) Tests & Docs: - Add test_tool_conflict_resolution.py (10 tests) - Add TestTeamStatePagination tests (7 tests) - Add context-management-fixes.md design document - Restore DEFAULT_API_KEY/DEFAULT_BASE_URL in env.sample

…tion - Add `allowed_at_critical` parameter to tool registry for budget-aware filtering - User prompts (USER_PROMPT, PARTNER_DIRECTIVE, PRINCIPAL_COMPLETED) bypass circuit breaker to use reserved 15% headroom - Restrict Partner tools to read-only at CRITICAL/EXCEEDED thresholds - Mark flow-terminating tools (finish_flow, generate_message_summary) and GetPrincipalStatusSummaryTool as critical-safe - Add filter_tools_for_critical_budget() in agent_strategy_helpers - Add --output flag to analyze_session.py and live_session_query.py Tests: 13 new tests for strategy helpers, 4 for guardian bypass, 6 for registry

Myles Dear added 8 commits December 30, 2025 23:18

Remove local mcp server from mcp list.

e741259

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add session resilience and context budget management#9

feat: Add session resilience and context budget management#9
mdear wants to merge 9 commits intoIntelligent-Internet:mainfrom
mdear:feature/session-resilience-and-context-budget

mdear commented Dec 29, 2025

Uh oh!

mdear commented Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mdear commented Dec 29, 2025

Session Resilience (Backend + Frontend)

Context Budget Management

Test Coverage

Documentation

Other Changes

Uh oh!

mdear commented Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant